AITopics | discrete action

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.32)

Chopra, Samarth, McMoil, Alex, Carnovale, Ben, Sokolson, Evan, Kubendran, Rajkumar, Dickerson, Samuel

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

arXiv.org Artificial IntelligenceNov-10-2025

Abstract-- While Vision-Language-Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for $300, capable of modest payloads and workspaces. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensembler monitors motion uncertainty to trigger on-the-fly replanning for safe, reliable operation. On LIBERO, Ev-erydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model, and paves the way for economical use in homes and research labs alike.

artificial intelligence, arxiv preprint arxiv, manipulator, (14 more...)

2511.05397

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Neural Information Processing SystemsOct-2-2025, 21:56:04 GMT

4f20f7f5d2e7a1b640ebc8244428558c-AuthorFeedback.pdf

artificial intelligence, exploration, machine learning, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.49)

arXiv.org Artificial IntelligenceAug-21-2025

EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy

Ng, Chi Kit, Bai, Long, Wang, Guankun, Wang, Yupeng, Gao, Huxin, Yuan, Kun, Jin, Chenhan, Zeng, Tieyong, Ren, Hongliang

In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Given endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting. To tackle data scarcity and domain shifts, we propose a dual-phase strategy comprising supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning with task-aware rewards. Our approach significantly improves tracking performance in endoscopy and enables zero-shot generalization in diverse scenes and complex sequential tasks.

arxiv preprint arxiv, large language model, natural language, (18 more...)

2505.15206

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Neural Information Processing SystemsAug-15-2025, 19:02:42 GMT

ad7ed5d47b9baceb12045a929e7e2f66-Supplemental.pdf

agent, incentive, incentive function, (16 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

arXiv.org Artificial IntelligenceAug-8-2025

Real-Time Iteration Scheme for Diffusion Policy

Duan, Yufei, Yin, Hang, Kragic, Danica

Diffusion Policies have demonstrated impressive performance in robotic manipulation tasks. However, their long inference time, resulting from an extensive iterative denoising process, and the need to execute an action chunk before the next prediction to maintain consistent actions limit their applicability to latency-critical tasks or simple tasks with a short cycle time. While recent methods explored distillation or alternative policy structures to accelerate inference, these often demand additional training, which can be resource-intensive for large robotic models. In this paper, we introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme, a method from optimal control that accelerates optimization by leveraging solutions from previous time steps as initial guesses for subsequent iterations. We explore the application of this scheme in diffusion inference and propose a scaling-based method to effectively handle discrete actions, such as grasping, in robotic manipulation. The proposed scheme significantly reduces runtime computational costs without the need for distillation or policy redesign. This enables a seamless integration into many pre-trained diffusion-based models, in particular, to resource-demanding large models. We also provide theoretical conditions for the contractivity which could be useful for estimating the initial denoising step. Quantitative results from extensive simulation experiments show a substantial reduction in inference time, with comparable overall performance compared with Diffusion Policy using full-step denoising. Our project page with additional resources is available at: https://rti-dp.github.io/.

artificial intelligence, diffusion policy, score prediction time, (14 more...)

2508.05396

Country: Europe (0.28)

Genre: Research Report (0.73)

Industry: Energy (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

arXiv.org Artificial IntelligenceJun-10-2025

Prime the search: Using large language models for guiding geometric task and motion planning by warm-starting tree search

Lee, Dongryung, Joo, Sejune, Lee, Kimin, Kim, Beomjoon

The problem of relocating a set of objects to designated areas amidst movable obstacles can be framed as a Geometric Task and Motion Planning (G-TAMP) problem, a subclass of task and motion planning (TAMP). Traditional approaches to G-TAMP have relied either on domain-independent heuristics or on learning from planning experience to guide the search, both of which typically demand significant computational resources or data. In contrast, humans often use common sense to intuitively decide which objects to manipulate in G-TAMP problems. Inspired by this, we propose leveraging Large Language Models (LLMs), which have common sense knowledge acquired from internet-scale data, to guide task planning in G-TAMP problems. To enable LLMs to perform geometric reasoning, we design a predicate-based prompt that encodes geometric information derived from a motion planning algorithm. We then query the LLM to generate a task plan, which is then used to search for a feasible set of continuous parameters. Since LLMs are prone to mistakes, instead of committing to LLM's outputs, we extend Monte Carlo Tree Search (MCTS) to a hybrid action space and use the LLM to guide the search. Unlike the previous approach that calls an LLM at every node and incurs high computational costs, we use it to warm-start the MCTS with the nodes explored in completing the LLM's task plan. On six different G-TAMP problems, we show our method outperforms previous LLM planners and pure search algorithms. Code can be found at: https://github.com/iMSquared/prime-the-search

large language model, machine learning, natural language, (18 more...)

doi: 10.1177/02783649251347307

2506.07062

Country:

North America > Canada (0.46)
Europe (0.46)

Genre:

Collection > Journal > Special Issue (0.40)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Neural Information Processing SystemsMay-26-2025, 18:53:31 GMT

Grounding Multimodal Large Language Models in Actions

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, including both continuous and discrete actions. For continuous actions, a set of learned tokenizations that capture an action at various resolutions allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action grounding approaches on five different environments, encompassing over 114 embodied tasks.

artificial intelligence, large language model, natural language, (3 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)